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Abstract. We calculate the storage capacity of a perceptron for correlated gaussian 
patterns. We find that the storage capacity a c can be less than 2 if similar patterns 
^| ■ are mapped onto different outputs and vice versa. As long as the patterns are in 

general position we obtain, in contrast to previous works, that a c > 1 in agreement 
with Cover's theorem. Numerical simulations confirm the results. 
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The critical storage capacity of a simple perceptron for randomly chosen 
input/output pairs is known to be a c = p/N = 2, with p the number of stored patterns 
and N their input dimension. This result was first derived by Cover (1965) using a 
geometrical argument and later by Gardner (1988) and Gardner and Derrida (1988) by 
calculating the fractional phase space volume of consistent couplings with the tools of 
J-J ■ statistical mechanics and the replica approach. 

Cover's theorem states that, as long as the patterns = l,...,p) are in general 
^ ■ position (no subset of A" or less patterns is linear dependent), the critical storage capacity 
a c is at least 1, independently of the corresponding outputs For a between 1 and 
2 the fraction of output combinations which are not linearly separable is exponentially 
small in N. The converse holds for a larger than 2. In this case a randomly chosen 
sequence of outputs will not be linearly separable with a probability approaching 1 
as N —>■ oo. This is even true if correlations among the patterns £ M are introduced 
(Monasson 1992). 

One would argue that in general correlations which include the outputs lead to higher 
critical storage capacities as it is for instance the case for biased patterns (Gardner 1988). 
Recently it has been found (Bork 1994, Schroder et al 1995) that patterns and outputs 
extracted from a bit-sequence seem to lead to smaller storage capacities than a c = 2 for 
a perceptron. The bit-sequence is thought as an infinite time series Q (i = 1,2,...) in 
which the first pattern results from the first A" values Q (i = 1, . . . , N) and its output 
from the (N + l)th value (n+i (or sign(<^v +1 ) for continuous valued Q). By moving 
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this (N + l)-broad window one step forward, the second pattern and its corresponding 
output result, and so on. If the Q are drawn at random from a distribution with 
(Ci) = 0, ((f) = 1, one finds that the critical storage capacity of a perceptron which stores 
the resulting input/output pairs s M ) is a c ~ 1.82 for binary Q and a c ~ 1.88 for 
gaussian Q. This result indicates that the embedded correlations between input/output 
pairs resulting from the bit-sequence lead to a reduction of the storage capacity when 
compared to randomly chosen pairs and are harder to implement in a perceptron. 

Taking the former as a motivation, we will investigate in the present letter the 
effect of correlations between input/output pairs on the critical storage capacity of 
a perceptron. The main idea is that the storage of similar patterns with different 
outputs should be more difficult to implement in a perceptron than the case of similar 

— # 

patterns with identical outputs. If we introduce the transformed patterns <? M = ^s^, 
this similarity or dissimilarity can be described with a positive or negative overlap 
(R = iv -1 ^ • Oy) respectively between two transformed patterns fi and v. Without 
loss of generality we fix all outputs to have the value s M = +1 (p, — 1, . . . , p), so that 
<?/i = Let us now take pairs of patterns with a fixed overlap R. With a normalization 
such that = N, we have: 

4-i-4 = ^ for /i = l,...,p/2 (1) 

with \R\ < 1. Apart from these fixed overlaps two arbitrary patterns shall not be 
correlated and thus will have an overlap of N' 1 ^ ■ £„ = in the thermodynamic limit 
N -> oo. 

Before evaluating the storage capacity for a general R let us consider some special 
cases of interest. For R = 1 the two patterns out of a pair are identical (for N — > oo), so 
that the storage of the first pattern automatically implements the second one, concluding 
that a c (R =1) =4. For R = no correlations are present and a c (R = 0) = 2 as 
usual. For R — — 1 the patterns are not in general position, since p/2 of them are 
pairwise linearly dependent. Already the first two patterns cannot be implemented by a 
perceptron, so that a c (R = —1) = 0. However, if R is very close to —1 general position 
is guaranteed and thus a c (R — — 1 + e) > 1, (e > 0) according to Cover's theorem. We 
will investigate this case later in more detail. 

Following Gardners approach we consider the fractional phase space volume V of 

— * 

couplings J that are consistent with the constraints imposed by the patterns (version 
space): 

jv v ( i n \ / n \ 

v = v t -J JjidJi ne(^ E S [Y. J ?- N ) > (2) 

with 

n / n \ 

V tot = / HdJ.sl^Jf-N) . (3) 

J j=i \i=i / 
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It is clear, that for negative R the reduction of the version space by every new pair of 
patterns is more drastic than for R = and so we expect a c to be less than 2. 

We now fix s M = +1 (/i = l,...,p) and introduce the conditions (0) via delta 
functions. The average ((lnV))^ will be performed by means of the replica trick and is 
defined by: 

- P/2 N p/2 

c N J n n ^) /(FLJ n ^-^Vi • 4 - *) (4) 

^=1 i=l M = l 

where Cat is the normalization constant resulting from ((1))^ = 1, and Dx = 
(ixexp (— x 2 /2)/\/2n is the Gaussian measure. As usual in the calculation the order 
parameter q a p appears, which measures the overlap between two replicas. Making the 
replica symmetric ansatz q a p = q (Va < (3) and using ((lnV))^ = \m\ n ^n~ l \\i{{V n )) ^ 
together with the saddle point method to evaluate the integrals in the thermodynamic 
limit, one obtains: 

l((lnV» ? = Extr || J Dx J Dz \nf(q, R, k, x jZ ) + ± ln(l - q) + j (5) 

with 

poo poo 

f(q,R,K,x,z) = yl — R 2 / Duo \ Du exp(Ruw) (6) 

J 7 j 5 

where 

k + Jqx . k + v ^( j Rx + VI - # 2 
7 = : / i o = , — ; (7) 

The critical storage capacity is reached when the version space shrinks to a single point 
and thus q reaches unity. From the extremum condition in (|5|) we obtain for the critical 
storage capacity a c 

d_ 
dq 



a k) = lim 

g-vl 



K 1 ~ (if j DX j DZ R ' K ' X ' ^ 



(8) 



In the limit q — > 1 one has to consider the cases of positive and negative i? separately 
yielding 

pxg poo ^2 m />2i ^2 



/ Dz — + Dx Dz 

-oo izn 2 Jxn J — oo 
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'20 ^ J ^0 
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, . . P° /" 2 2 b 2 r x o ^ f°° ^ a 2 - 2Rab + b 2 
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+ / / Dz — + Dx Dz — — — — (i? < 0) 
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with a = k + x , b = k + Rx + y/1 — R? z and 

k + Rx k(1-R) k(1- R)+x{1- R 2 ) 

For the interesting case k — this expression reduces to a surprisingly simple form: 

a~ 1 (R,n = 0) = (- + — J , = arccos J R (10) 



Here is the angle between two correlated patterns, and (10) holds for — 1 < R < 1. 
The resulting curve is plotted in figure [I]. The simulations have been carried out in 
two ways: First, one can calculate the average probability, that a given set of patterns 
with correlations as described above, is linearly separable for different values of a. The 
condition for the critical capacity is that this probability equals 1/2. The second method 
is to assume that the median learning time (for the perceptron learning rule) scales as 
r -°- 5 ^ (pi c — a) for a — > a c . This can be proven for uncorrelated patterns (Opper 
1988) and has been used by Priel et al (1994) to determine the critical capacity by 
extrapolating the curve to r — > oo. The inset of figure [I] shows clearly that the mentioned 
scaling law is obeyed in our case too. 

As we expected, for negative R the storage capacity lies below 2, approaching the 
value of a c = 4/3 for R — > —1. This result can be understood as follows. Every pair 
/i = 1, . . . ,p/2 of correlated patterns defines a vector L M = £2^-1 — £2^- For R — > — 1 the 
coupling vector J falls for every /i into a hyperhalfplane orthogonal to L^. This leads 
to the constraints: 

(1) 

(2) f,-J> n=l,...,p/2 (11) 

for J, where T M is a vector orthogonal to L M . The second constraint defines a new 
perceptron problem with p/2 uncorrelated patterns T M for a coupling vector J with an 
effective number of dimensions of (N — p/2) due to the first constraint. According to 
Cover's theorem (p/2)/(N — p/2) = 2 and thus a c = p/N = 4/3. 

Up to now we have assumed that the overlap R is the same for all pairs. The case 
where two patterns have an overlap R with a probability P(R) can also be handeled. 
The averages over different values of R factorize and the storage capacity is simply given 
by 

a~ l (K = 0) = r dR P(R) (\ + jL)= (°° dR P{R) a~\R, k = 0) (12) 
j-00 y4 2n J J-00 

The last equality states that the reciprocal values of the capacity for a fixed value 
of the overlap weighted with their corresponding probability sum up to the reciprocal 
of the total capacity. If the distribution is symmetric (P(R) = P(—R)) the storage 
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capacity is a c = 2. This is immediately clear if one thinks of a correlation with a 
primary distribution Pi(R) and in addition chooses outputs s M at random ±1 instead 
of taking them all equal to +1. The new distribution of the correlations is then given 
by P(R) = P x {R)/2 + P 1 (-R)/2, which is symmetric and will lead in ([TJD to a c = 2 as 
it should, since we have taken the outputs to be random. 

Let us now turn to the case where more than two patterns are correlated among 
each other. In the case of three patterns with equal pairwise overlap R we can conclude 
that for R = 1 the storage capacity is a c = 6 for k = 0. If R tends to —1/2 which is the 
minimal accessible value in this geometrical argument similar to (11) leads us 

to (p/3)/(N — 2p/3) = 2 and thus a c = p/N = 6/5. The calculation of a c (R) for other 
values of R should be more complicated than in the the former case, since to disentangle 
the additional correlations one has to introduce more gaussian fields. In general, if the 
patterns are correlated in tupels of m it follows in the same way: 

a c (R = 1, k = 0) = 2m 

a c (R = 0,K = 0) =2 

ac ( R ^^L ;K = o) = J^}_ (13) 

V m — 1 / 2m — 1 

For m — > oo, a c tends to 1 as R approaches the minimal possible value — l/(m— 1). 

As long as m is of the order 1 compared to the number of patterns p, we will have 
to calculate a c (R) as for m — 2, which for larger m becomes very complicated. If m 
however is of the order p 7 with < 7 < 1 we are able to proceed in a much simpler 
way. For this purpose we first introduce as in (Fontanari and Meir 1989) the correlation 
matrix C for the patterns: 

V=<tfS% V/i,z/ = l,...,p (14) 

The & = (£/,..., £f ) are then distributed as 



/ (27r) p detC 

The class of correlations we have considered is described by 

R for /1 = ma — r, v = ma — e, a = 1, . . . ,p/m 

with r, e = 0, . . . , (m — 1) (r/e) 
1 for /i = v 
else 



(16) 



The average over the fractional phase space of couplings is now performed with ( fOp 
and results for m — > 00 and — > 00 in 



Extr j a Extr |— irr — —c (1 — q) r 2 + J Dt lnif(u;)| 



+ ^ ln < 1 -«' + ^ < 17) 
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uj = K + f +^ f j H{x) = I Dt (is) 



with 



\/l - q 

and c = Rm in the limit m — > oo, so — 1 < c < oo since — (m — < i? < 1. We 
have made the replica symmetric ansatz for and the additional order parameter 
r a = m" 1 Y^=i x ai where are the conjugate variables to the local fields A^, and its 
conjugate f a . 

If we solve (17) for the extremum we first can write r in terms of f and find in the 
limit q — > 1 



- = (K + f)H(-K-f) H ^exp f~(« + r) 2 } (19) 

c V27T I 2 J 



a; 1 (c,/«) = iJ(-/{-f) + k - (20) 

c 

From the first equation we can obtain f numerically and plug it into the second to find 
a c . Figure shows the storage capacity as a function of c for several values of k. For 
k = 0, a c approaches 1 as c — > — 1. This behaviour is in agreement with (13) for large 
m. Fontanari and Meir (1989) considered the case m = p, so all patterns are pairwise 
correlated. However, due to an error in one of the saddle point equations a c (K = 0) 
becomes less than 1 for values below c ~ —0.7 and reaches a c = for c — > — 1. 

For hierachically correlated biased patterns the storage capacity has been calculated 
by Engel (1990). Above certain values of the bias the critical capacity is less than two 
and for certain ranges even less than one. It is not clear to us at which point general 
position is violated in this case. 

In summary, we have analyzed the behaviour of the storage capacity of a perceptron 
for correlated patterns. We find, that the storage capacity is lowered with respect to 
uncorrelated patterns when different patterns (negative overlap) are mapped onto the 
same output, but does not fall below one, in agreement with Cover's theorem. As a 
consequence we suggest that the correlation matrix of the patterns should be analyzed 
for problems which lead to a reduced storage capacity as for example the bit sequence. 

Future work should include the calculation of the stability of the replica symmetric 
solution and the storage capacity for a binary perceptron. One could also investigate the 
consequences for other architectures such as the commitee or parity machine, although 
we think that the results should be similar for these cases. 

After completion of this work we have learned that a similar problem has been 
studied by Winkel (1995) using a different approach. 
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Figure captions 



Figure 1. The critical storage capacity a c as a function of the overlap R between two 
correlated patterns for k = 0. The dots with their corresponding error bars are results 
from numerical simulations for systems with N = 100. Inset: The median learning 
time to the power of —1/2 as a function of a for R = —.95. The values are averaged 
over 1000 samples, the line is a least square fit for the data and the intersection between 
the extrapolation and the x-axis gives the estimated value of a c . 



Figure 2. The critical storage capacity a c as a function of the correlation c = R m 
for various values of k = 0.0, 0.2, 1.0, 2.0 (from top to bottom). 
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